CMA-ES for Hyperparameter Optimization of Deep Neural Networks
نویسندگان
چکیده
Hyperparameters of deep neural networks are often optimized by grid search, random search or Bayesian optimization. As an alternative, we propose to use the Covariance Matrix Adaptation Evolution Strategy (CMA-ES), which is known for its state-of-the-art performance in derivative-free optimization. CMA-ES has some useful invariance properties and is friendly to parallel evaluations of solutions. We provide a toy example comparing CMA-ES and state-of-the-art Bayesian optimization algorithms for tuning the hyperparameters of a convolutional neural network for the MNIST dataset on 30 GPUs in parallel. Hyperparameters of deep neural networks (DNNs) are often optimized by grid search, random search (Bergstra & Bengio, 2012) or Bayesian optimization (Snoek et al., 2012a; 2015). For the optimization of continuous hyperparameters, Bayesian optimization based on Gaussian processes (Rasmussen & Williams, 2006) is known as the most effective method. While for joint structure search and hyperparameter optimization, tree-based Bayesian optimization methods (Hutter et al., 2011; Bergstra et al., 2011) are known to perform better (Bergstra et al.; Eggensperger et al., 2013; Domhan et al., 2015), here we focus on continuous optimization. We note that integer parameters with rather wide ranges (e.g., number of filters) can, in practice, be considered to behave like continuous hyperparameters. As the evaluation of a DNN hyperparameter setting requires fitting a model and evaluating its performance on validation data, this process can be very expensive, which often renders sequential hyperparameter optimization on a single computing unit infeasible. Unfortunately, Bayesian optimization is sequential by nature: while a certain level of parallelization is easy to achieve by conditioning decisions on expectations over multiple hallucinated performance values for currently running hyperparameter evaluations (Snoek et al., 2012a) or by evaluating the optima of multiple acquisition functions concurrently (Hutter et al., 2012; Chevalier & Ginsbourger, 2013; Desautels et al., 2014), perfect parallelization appears difficult to achieve since the decisions in each step depend on all data points gathered so far. Here, we study the use of a different type of derivative-free continuous optimization method where parallelism is allowed by design. The Covariance Matrix Adaptation Evolution Strategy (CMA-ES (Hansen & Ostermeier, 2001)) is a state-of-the-art optimizer for continuous black-box functions. While Bayesian optimization methods often perform best for small function evaluation budgets (e.g., below 10 times the number of hyperparameters being optimized), CMA-ES tends to perform best for larger function evaluation budgets; for example, Loshchilov et al. (2013) showed that CMA-ES performed best among more than 100 classic and modern optimizers on a wide range of blackbox functions. CMA-ES has also been used for hyperparameter tuning before, e.g., for tuning its own Ranking SVM surrogate models (Loshchilov et al., 2012) or for automatic speech recognition (Watanabe & Le Roux, 2014). In a nutshell, CMA-ES is an iterative algorithm, that, in each of its iterations, samples λ candidate solutions from a multivariate normal distribution, evaluates these solutions (sequentially or in parallel) and then adjusts the sampling distribution used for the next iteration to give higher probability to good samples. (Since space restrictions disallow a full description of CMA-ES, we refer to Hansen & Ostermeier (2001) for details.) Usual values for the so-called population size λ are around 10 to 20; in the study we report here, we used a larger size λ = 30 to take full benefit of 30 GeForce GTX TITAN Black GPUs we had available. Larger values of λ are also known to be helpful for noisy 1 ar X iv :1 60 4. 07 26 9v 1 [ cs .N E ] 2 5 A pr 2 01 6 Workshop track ICLR 2016 200 400 600 80
منابع مشابه
Practical Hyperparameter Optimization
Recently, the bandit-based strategy Hyperband (HB) was shown to yield good hyperparameter settings of deep neural networks faster than vanilla Bayesian optimization (BO). However, for larger budgets, HB is limited by its random search component, and BO works better. We propose to combine the benefits of both approaches to obtain a new practical state-of-the-art hyperparameter optimization metho...
متن کاملBayesian Neural Networks for Predicting Learning Curves
The performance of deep neural networks (DNNs) crucially relies on good hyperparameter settings. Since the computational expense of training DNNs renders traditional blackbox optimization infeasible, recent advances in Bayesian optimization model the performance of iterative methods as a function of time to adaptively allocate more resources to promising hyperparameter settings. Here, we propos...
متن کاملSpeeding Up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves
Deep neural networks (DNNs) show very strong performance on many machine learning problems, but they are very sensitive to the setting of their hyperparameters. Automated hyperparameter optimization methods have recently been shown to yield settings competitive with those found by human experts, but their widespread adoption is hampered by the fact that they require more computational resources...
متن کاملHyperparameter Transfer Learning through Surrogate Alignment for Efficient Deep Neural Network Training
Recently, several optimization methods have been successfully applied to the hyperparameter optimization of deep neural networks (DNNs). The methods work by modeling the joint distribution of hyperparameter values and corresponding error. Those methods become less practical when applied to modern DNNs whose training may take a few days and thus one cannot collect sufficient observations to accu...
متن کاملEfficient Hyperparameter Optimization for Deep Learning Algorithms Using Deterministic RBF Surrogates
Automatically searching for optimal hyperparameter configurations is of crucial importance for applying deep learning algorithms in practice. Recently, Bayesian optimization has been proposed for optimizing hyperparameters of various machine learning algorithms. Those methods adopt probabilistic surrogate models like Gaussian processes to approximate and minimize the validation error function o...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1604.07269 شماره
صفحات -
تاریخ انتشار 2016